In this tutorial, we will learn how to set up a R Markdown document for reproducible statistical analysis/modeling, primarily using functions and syntax from the KnitR package. On Tuesday, we will ensure that everyone has downloaded the right data and packages, go over code chunk setup, and go over different ways to source/code analysis. On Thursday, we will learn how to ensure analysis is reproducibly random and how to efficiently run computationally-intensive analysis.
We will learn how to dynamically connect data gathering and source code to markdown documents, setup pipelines to rerun analysis and present results whenever compiled, and keep markdown documents up to date and reflecting changes made to the data or analysis in source code files.
Coding is hard. It takes a long time, and you’ll likely have to go back to the same analysis multiple times to tweak and change things. It is important to setup a good workflow early to ensure you’re not scrambling to make your research reproducible at the end of the process. You’ll be saving yourself a lot of time if your analysis is reproducible and your results presented every time you knit your document.
Through this tutorial, students will learn to:
Make sure you have the following packages: “knitr”, “rmarkdown”, “bookdown”, “formattable”, “kableExtra”, “dplyr”, “magrittr”, “prettydoc”, “htmltools”, “knitcitations”, “bibtex”, “devtools”, “tidyverse”, “ebirdst”, “sf”, “raster”, “terra”, “patchwork”, “tigris”, “rnaturalearth”, “ggplot2”. Below are the code chunks for loading packages and setting up the document from Ch. 2:
###~~~
# Load R packages
###~~~
#Create a vector w/ the required R packages
# --> If you have a new dependency, don't forget to add it in this vector
pkg <- c("knitr", "rmarkdown", "bookdown", "formattable", "kableExtra", "dplyr", "magrittr", "prettydoc", "htmltools", "knitcitations", "bibtex", "devtools", "tidyverse", "ebirdst", "sf", "raster", "terra", "patchwork", "tigris", "rnaturalearth", "ggplot2")
##~~~
#2. Check if pkg are already installed on your computer
##~~~
print("Check if packages are installed")
#This line below outputs a list of packages that are not installed. Which ones of the packages above are installed in computer, print packages that are not installed. Should print 0.
new.pkg <- pkg[!(pkg %in% installed.packages())]
##~~~
#3. Install missing packages
##~~~
# Use an if/else statement to check whether packages have to be installed
# WARNING: If your target R package is not deposited on CRAN then you need to adjust code/function
if(length(new.pkg) > 0){
print(paste("Install missing package(s):", new.pkg, sep=' '))
install.packages(new.pkg, dependencies = TRUE)
}else{
print("All packages are already installed!")
}
##~~~
#4. Load all required packages
##~~~
print("Load packages and return status")
#Here we use the sapply() function to require all the packages
# To know more about the function type ?sapply() in R console
# Just an easier way to load all packages
sapply(pkg, require, character.only = TRUE)
# debug = right-pointing arrow next to code will load in console so you can go line by line and debug. pressing this also seemed to resolve the CRAN needs a mirror error
# Generate BibTex citation file for all loaded R packages
# used to produce report Notice the syntax used here to
# call the function
knitr::write_bib(.packages(), file = "packages.bib")
# The .packages() function returns invisibly the names of all packages loaded in the current R session (to see the output, use .packages(all.available = TRUE)). This ensures that all packages used in your code will have their citation entries written to the .bib file. This packages is then used in the Appendix to cite the code used
Setup Chunk Options:
### Chunk options: see http://yihui.name/knitr/options/ ###
### Text output
opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, include = TRUE)
## Code formatting
opts_chunk$set(tidy = TRUE, tidy.opts = list(blank = FALSE, width.cutoff = 60),
highlight = TRUE)
## Code caching
opts_chunk$set(cache = 2, cache.path = "cache/")
## Plot output The first dev is the master for the output
## document
opts_chunk$set(fig.path = "Figures_MS/", dev = c("png", "pdf"),
dpi = 300)
## Figure positioning
opts_chunk$set(fig.pos = "H")
Download data from google drive. For this chapter, we will be working with a couple of datasets. Mike is working on consolidating some different exploratory analyses he did at different times over the last year. He wants to get them in the same place and make sure that they are reproducible. One set of data comes from the supplementary material of a paper. It is a csv file that was pulled from one of the author of the paper’s github. The other dataset came from the r package ebirdst, which interfaces with data from the community science platform eBird. We’ll start with a dataset of “encounter rates” calculated from eBird data, which are calculated as the percentage of checklists that include a particular species in a given area (Schuetz and Johnston, 2019). This data has been analyzed for the United States by state, and Mike is getting ready to calculate encounter rates for species in Sonora, Mexico that are also found in Idaho. Before digging into the eBird data for Sonora, Mike wanted to get to know the dataset for Idaho so that he knows what species to look for in the Sonora data. Let’s get that data ready.
#Load Data
library(tidyverse)
# load csv file of data
state_enc_rate <- read.csv("01_raw_data/state.level.enc.rate.csv")
summary(state_enc_rate)
## common.name scientific.name state query.volume.state
## Length:31722 Length:31722 Length:31722 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 10.35
## 3rd Qu.: 8.00
## Max. :100.00
## encounter.state.normalized
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.3636
## Mean : 15.0947
## 3rd Qu.: 21.4322
## Max. :100.0000
head(state_enc_rate)
## common.name scientific.name state query.volume.state
## 1 Abert's towhee Melozone aberti Alabama 0
## 2 Abert's towhee Melozone aberti Alaska 0
## 3 Abert's towhee Melozone aberti Arizona 100
## 4 Abert's towhee Melozone aberti Arkansas 0
## 5 Abert's towhee Melozone aberti California 0
## 6 Abert's towhee Melozone aberti Colorado 0
## encounter.state.normalized
## 1 0.000000
## 2 0.000000
## 3 100.000000
## 4 0.000000
## 5 2.357214
## 6 0.000000
There is more data in this dataframe than we need, so lets do some transformation to get the data where we want it. Specifically, we are going to drop a column we don’t need, filter the data down to just Idaho, and filter for species with encounter rates higher than 50%.
# Get rid of uneccessary columns
state_enc_rate_clean <- state_enc_rate %>%
dplyr::select(-query.volume.state #not needed for analysis
)
view(state_enc_rate_clean)
# filter to just Idaho
state_enc_rate_clean_filtered <- state_enc_rate_clean %>%
filter(state == "Idaho", )
view(state_enc_rate_clean_filtered)
# get rid of species with no encounter rate
ID_enc_rate <- state_enc_rate_clean_filtered %>%
filter(encounter.state.normalized != 0)
# get ride of encounter rates below 50 to focus on commonly
# encou
ID_enc_rate <- ID_enc_rate %>%
filter(encounter.state.normalized > 50)
# check size of final dataframe
summary(ID_enc_rate)
## common.name scientific.name state
## Length:63 Length:63 Length:63
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## encounter.state.normalized
## Min. : 50.01
## 1st Qu.: 58.34
## Median : 69.38
## Mean : 71.40
## 3rd Qu.: 83.32
## Max. :100.00
# make a table for markdown output
kable(ID_enc_rate)
| common.name | scientific.name | state | encounter.state.normalized |
|---|---|---|---|
| American coot | Fulica americana | Idaho | 55.75646 |
| American kestrel | Falco sparverius | Idaho | 100.00000 |
| American robin | Turdus migratorius | Idaho | 72.07650 |
| American wigeon | Mareca americana | Idaho | 60.66470 |
| bank swallow | Riparia riparia | Idaho | 63.75313 |
| belted kingfisher | Megaceryle alcyon | Idaho | 50.01036 |
| black-billed magpie | Pica hudsonia | Idaho | 83.85025 |
| black-headed grosbeak | Pheucticus melanocephalus | Idaho | 57.82899 |
| Brewer’s blackbird | Euphagus cyanocephalus | Idaho | 66.23826 |
| Bullock’s oriole | Icterus bullockii | Idaho | 72.90967 |
| California quail | Callipepla californica | Idaho | 100.00000 |
| calliope hummingbird | Selasphorus calliope | Idaho | 92.01797 |
| Canada goose | Branta canadensis | Idaho | 69.74989 |
| Cassin’s finch | Haemorhous cassinii | Idaho | 78.77421 |
| Cassin’s vireo | Vireo cassinii | Idaho | 100.00000 |
| cedar waxwing | Bombycilla cedrorum | Idaho | 64.66979 |
| chukar | Alectoris chukar | Idaho | 52.27404 |
| cinnamon teal | Spatula cyanoptera | Idaho | 50.02885 |
| cliff swallow | Petrochelidon pyrrhonota | Idaho | 62.71355 |
| common goldeneye | Bucephala clangula | Idaho | 73.44130 |
| common merganser | Mergus merganser | Idaho | 77.87708 |
| common raven | Corvus corax | Idaho | 56.20139 |
| dark-eyed junco | Junco hyemalis | Idaho | 70.30395 |
| dusky flycatcher | Empidonax oberholseri | Idaho | 88.66510 |
| Eurasian collared-dove | Streptopelia decaocto | Idaho | 68.98302 |
| European starling | Sturnus vulgaris | Idaho | 60.81733 |
| evening grosbeak | Coccothraustes vespertinus | Idaho | 53.82014 |
| great horned owl | Bubo virginianus | Idaho | 99.39787 |
| Hammond’s flycatcher | Empidonax hammondii | Idaho | 100.00000 |
| house finch | Haemorhous mexicanus | Idaho | 60.00921 |
| killdeer | Charadrius vociferus | Idaho | 56.41970 |
| lazuli bunting | Passerina amoena | Idaho | 85.12329 |
| Lewis’s woodpecker | Melanerpes lewis | Idaho | 71.66795 |
| long-eared owl | Asio otus | Idaho | 66.75941 |
| MacGillivray’s warbler | Geothlypis tolmiei | Idaho | 76.69511 |
| mallard | Anas platyrhynchos | Idaho | 87.65577 |
| mountain chickadee | Poecile gambeli | Idaho | 63.37515 |
| northern flicker | Colaptes auratus | Idaho | 86.56977 |
| northern harrier | Circus cyaneus | Idaho | 71.13981 |
| northern pygmy-owl | Glaucidium gnoma | Idaho | 69.37824 |
| northern rough-winged swallow | Stelgidopteryx serripennis | Idaho | 58.52182 |
| northern saw-whet owl | Aegolius acadicus | Idaho | 73.17100 |
| olive-sided flycatcher | Contopus cooperi | Idaho | 55.03477 |
| pine siskin | Spinus pinus | Idaho | 78.22397 |
| prairie falcon | Falco mexicanus | Idaho | 73.68943 |
| red crossbill | Loxia curvirostra | Idaho | 52.45629 |
| red-breasted nuthatch | Sitta canadensis | Idaho | 89.29945 |
| red-tailed hawk | Buteo jamaicensis | Idaho | 83.58497 |
| red-winged blackbird | Agelaius phoeniceus | Idaho | 65.22796 |
| ring-necked duck | Aythya collaris | Idaho | 87.38486 |
| rock pigeon | Columba livia | Idaho | 51.20198 |
| sage thrasher | Oreoscoptes montanus | Idaho | 53.83167 |
| sharp-shinned hawk | Accipiter striatus | Idaho | 53.95897 |
| song sparrow | Melospiza melodia | Idaho | 58.15657 |
| spotted sandpiper | Actitis macularius | Idaho | 76.10001 |
| Swainson’s hawk | Buteo swainsoni | Idaho | 95.49056 |
| western grebe | Aechmophorus occidentalis | Idaho | 67.58516 |
| western kingbird | Tyrannus verticalis | Idaho | 53.46191 |
| western screech-owl | Megascops kennicottii | Idaho | 83.04902 |
| western tanager | Piranga ludoviciana | Idaho | 98.95006 |
| western wood-pewee | Contopus sordidulus | Idaho | 67.01473 |
| willow flycatcher | Empidonax traillii | Idaho | 57.60302 |
| yellow warbler | Setophaga petechia | Idaho | 67.49039 |
# create clean data folder and save new data file
dir.create("02_Clean_data", showWarnings = FALSE)
write.csv(ID_enc_rate, file = "02_Clean_data/ID_enc_rate.csv",
row.names = FALSE)
Although dynamically linking source code to markdown documents is the main goal of this chapter, sometimes it makes sense to include short code chunks directly into your markdown document. There are a variety of customizable options on how the code and its output are display in the final knitted document. We use code chunk arguments from the knitr package to choose the parts of our code chunks that are outputted in the final html document.
First, we will learn how to set up a code chunk. We use code chunk arguments from the knitr package to choose the parts of our code chunks that are outputted in the final html document.
Code chunk arguments:
For eval and echo, you can tell knitr which expressions in the chunk to include using numerical vectors. For example, eval = 1:2 will only include the 1st 2 expressions of code in the output document.
Take 5 minutes to discuss with the person sitting next to you. Given the same chunk of analysis, how would you set up your chunk arguments when the document is being sent to your advisor for review versus being sent to a journal for publication? Then share with the class.
For this tutorial, we will create a code chunk to compare the mean encounter rate of songbirds and waterfowl in the Idaho dataset.
What code chunk arguments should we use if we want to figure to show up in the output document but not the code?
First we are going to to do a little bit more data transformation to make a couple of objects of the Idaho encounter rate data grouped taxonomically for some descriptive statistics we will use later.
# Pull passerines from ID encounter rates list
passerine_enc_rate <- ID_enc_rate %>%
filter(common.name %in% c("American robin", "bank swallow",
"black-billed magpie", "black-headed grosbeak", "Brewer's blackbird",
"Bullock's oriole", "Cassin's finch", "Cassin's vireo",
"cedar waxwing", "cliff swallow", "dark-eyed junco",
"dusky flycatcher", "European starling", "evening grosbeak",
"Hammond's flycatcher", "house finch", "lazuli bunting",
"MacGillivray's warbler", "mountain chickadee", "northern rough-winged swallow",
"olive-sided flycatcher", "pine siskin", "red crossbill",
"red-winged blackbird", "sage thrasher", "song sparrow",
"western kingbird", "western tanager", "western wood-pewee",
"willow flycatcher", "yellow warbler"))
waterfowl_enc_rate <- ID_enc_rate %>%
filter(common.name %in% c("American coot", "American wigeon",
"cinnamon teal", "common goldeneye", "common merganser",
"mallard", "ring-necked duck", "western grebe"))
What do the chunk commands do:
We use code chunk arguments from the knitr package to choose the parts of our code chunks that are outputted in the final html document.
Code chunk arguments:
Take 5 minutes to discuss with the person sitting next to you. Given the same chunk of analysis, how would you set up your chunk arguments when the document is being sent to your advisor for review versus being sent to a journal for publication? Then share with the class.
For eval and echo, you can tell knitr which expressions in the chunk to include using numerical vectors. For example, eval = 1:2 will only include the 1st 2 expressions of code in the output document.
Sometimes you may want to have R code or output appear inline with the rest of the markdown document. These can be static or dynamic.
ID_enc_rate <- state_enc_rate_clean_filtered %>% filter(encounter.state.normalized!=0 )We will go over two other ways to include analysis in your R Markdown document: locally sourced modular analysis and url-sourced modular analysis. We will also go over the scenarios in which you might use each method.
You may have previously written code locally stored on your computer that you want to pull for an analysis. To do so, we will save that code in its own R file and use the source command from knitR to call it. This is most helpful in the early stages of your research, as this is not the best practice for ensuring reproducible science.
# Run main analysis
source("03_linked_scripts/Sonora_AMAV_migration_chron.R")
## eBird Status and Trends access key stored in: /Users/michaelkrzywicki/.Renviron
## Downloading Status Data Products for ameavo
## Data already exists, use force = TRUE to re-download.
AMAV_SO
Use the argument include = FALSE to run the analysis and produce objects that can still be used by other code chunks but will not show up in the output document.
Once you are further along in your research, you may want to make sure your file in publicly accessible to ensure reproducibility. You can host your file in a GitHub repository and use the source_url command from the devtools package to call it.
We have set up a pre-made GitHub file for this section.
source_url("https://raw.githubusercontent.com/rosean829/Rep-Sci-Chapter-9/main/03_linked_scripts/Idaho_AMAV_migration_chronology.R")
## ℹ SHA-1 hash of file is "dd74275f1b0a1301afdd9da8cf3db53c2fca5761"
## Downloading Status Data Products for ameavo
##
## Data already exists, use force = TRUE to re-download.
AMAV_ID
When an analysis produces simulations, a random number generator is generally used so that others can replicate your results. For this tutorial, we will do an exercise to learn the importance of using the set.seed command to ensure replicability.
You and the person next to you use the same distribution variables but diff vs. same set.seed. Do you get the same result? What about when you use the same set.seed?
# Set seed as 125
set.seed(125)
# Draw 1000 numbers
Draw2 <- rnorm(1000, mean = 0, sd = 2) # Summarize Draw1
# Draw 1000 other numbers
Draw1 <- rnorm(1000, mean = 5, sd = 1.5)
summary(Draw2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.2112 -1.4071 -0.1040 -0.1215 1.3155 5.6772
plot(Draw2)
plot(Draw1)
model <- lm(Draw1 ~ Draw2)
plot(model)
# Set seed as 150
set.seed(150)
# Draw 1000 numbers
Draw3 <- rnorm(1000, mean = 0, sd = 2) # Summarize Draw1
# Draw 1000 other numbers
Draw4 <- rnorm(1000, mean = 5, sd = 1.5)
summary(Draw2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.2112 -1.4071 -0.1040 -0.1215 1.3155 5.6772
plot(Draw2)